import pandas as pd
import numpy as np
import pymc as pm
import matplotlib.pyplot as plt
import arviz as az
import seaborn as sns
np.random.seed(42)
# Configuration
N_CASES = 50
N_JURORS = 15
JURY_SIZES = [3, 5, 7, 10, 15]
BLOCK_ID = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2])
N_BLOCKS = 3The Legibility Trap: Contemptible Familiarity
Decision making in the modern corporation faces a paradox. On the surface, a smooth-running machine optimized for predictable outcomes. In practice, a careful balance between genuine risk assessment and the familiar processes that make investors feel confident and executives feel in control. But when familiar process becomes sacred process, something breaks: reasonable bets give way to bankable certainty, and accountability dissolves into alignment.
To the investment analyst and the private equity partner, “unfamiliarity” is unquantifiable risk. They demand a recognizable org chart, a standardized “Agile” workflow, and a set of “Core Values” that could be swapped between a pet-food startup and a sovereign wealth fund without anyone noticing. This is the cult of Legibility. Borrowing from James C. Scott’s Seeing Like a State, we see that when central authorities cannot understand a complex, organic system, they flatten it. They replace the wild, high-information “forest” of human talent with a “plantation” of identical, predictable units.
When you optimize primarily for the gaze of the outsider or conformity with precedent, you risk dulling your workforce’s most valuable asset: their ability to see things differently. In our quest for “alignment” and adherence to legible standards, we can inadvertently destroy the only thing that makes a group smarter than an individual: the generative friction of our differences.
What makes this pattern so difficult to resist is that it isn’t purely imposed from above. We participate willingly; it is what Thi Nguyen calls “Value Capture”. In the rough ground of reality it’s genuinely hard to gauge quality, but in the planned organization we have metrics: ‘Alignment,’ ‘Velocity,’ and ‘Team Spirit’. We trade the rich but murky value of truth-seeking for the thin but legible value of metric-meeting. We do so eagerly—not because we’re foolish, but because legibility confers real benefits: clearer communication, easier justification, visible contribution. We adopt the value of the corporation as our own, and so short-circuit collective learning.
But “miracle” of collective intelligence is a generative process that requires diversity. It requires that we be wrong in different directions so that, in the aggregate, we might be right. By smoothing out the “noise” of individual culture and opinion, we break the statistical engine that makes democracy and decentralization work. We break the statistical engine of collective intelligence: The Condorcet Jury Theorem.
Do as I say, Not what I do.
Investors are among the world’s most fervent believers in the power of diversification. They know that to survive a volatile market, they must hedge their bets across uncorrelated assets. Yet while the investor hedges in the aggregate, they need each subordinate company to appear as a single, comparable unit on a spreadsheet. They abstract away the details of any individual firm, eliminating the local variance in their portfolio view. They want each firm to be “aligned,” “standardized,” and “legible.” In doing so, they incentivise a surface homogeneity that can propagate inward. The firms in their portfolio stop asking, ‘Is this the right decision?’ and start asking, ‘Does this decision look aligned to the investor’s model?’ or “Are our peers doing the same?” The macro consequences of such mimicry are still being felt today, many years after the financial crisis of 2008.
The Condorcet Jury Theorem (CJT) is the mathematical foundation of this drive for diversification. It shows that if you have a group of independent actors who are minimally competent (i.e. \(p > 0.5\) of being correct on binary decisions), then the probability of the majority being correct approaches 100% as the group size grows. There is real value in diversity when deployed well. But it’s brittle, and many corporate “best practices”—however well-intentioned—can inadvertently undermine it.
In this post, we move beyond the idealized “spreadsheet” view of decision-making to understand why modern organizations often fail in lockstep. We will:
- Revisit the Condorcet Miracle: Define the mathematical preconditions for collective intelligence and identify where corporate “best practices” begin to subvert them.
- Simulate Institutional Gravity: Use a hierarchical Bayesian model to isolate how individual skill, systemic case difficulty, and organizational “blocks” compete for influence over the final vote.
- Run the Post-Mortem: Use Posterior Predictive Checks (PPC) to visualize how standardized cultures create an “accuracy ceiling” that remains stagnant even as the group size grows.
- Bridge Theory to Philosophy: Connect the statistical reality of correlated error to James C. Scott’s “Legibility” and C. Thi Nguyen’s “Value Capture”.
Setup
To understand how legibility short-circuits learning, we need a way to build a “company” from the ground up and stress-test its decision-making. Our simulation function, simulate_jury_data, is the generative heart of this piece. It doesn’t just create random numbers; it creates a world where we can toggle the parameters of Generative Friction.
Data Generation and the Ground Truth
In a healthy system, learning happens in the gaps between different perspectives. But in a “legible” system, those gaps are closed. Our simulation allows us to model this across three dimensions
def simulate_jury_data(n_cases, n_jurors, true_p=0.65, true_discrimination=0.5,
block_id=None, true_sigma_block=1.2):
"""
Simulate jury voting data with optional block effects.
Parameters:
-----------
n_cases : int
Number of cases to judge
n_jurors : int
Number of jurors
true_p : float
Average competence (probability of correct vote)
true_discrimination : float
Standard deviation of competence in logit space
block_id : array, optional
Group membership for each juror (enables faction effects)
true_sigma_block : float
Standard deviation of block effects
Returns:
--------
votes : (n_cases, n_jurors) array
Binary voting matrix
p_jurors : (n_jurors,) array
True competence of each juror
true_states : (n_cases,) array
Ground truth for each case
"""
true_states = np.random.binomial(1, 0.5, n_cases)
# Simulate heterogeneous competencies in logit space
logit_p_jurors = np.random.normal(
np.log(true_p / (1 - true_p)),
true_discrimination,
n_jurors
)
# Add block effects if specified
if block_id is not None:
n_blocks = len(np.unique(block_id))
block_effect = np.random.normal(1, true_sigma_block, n_blocks)
logit_p_jurors += block_effect[block_id]
p_jurors = 1 / (1 + np.exp(-logit_p_jurors))
# Generate votes
votes = np.zeros((n_cases, n_jurors))
for i in range(n_cases):
for j in range(n_jurors):
prob = p_jurors[j] if true_states[i] == 1 else 1 - p_jurors[j]
votes[i, j] = np.random.binomial(1, prob)
return votes, p_jurors, true_states
# Generate our first dataset: simple case without blocks
votes, p_jurors, true_states = simulate_jury_data(N_CASES, N_JURORS)
print(f"Data simulated: {N_CASES} cases, {N_JURORS} jurors")
print(f"True average competence: {p_jurors.mean():.3f}")
majority = (votes.mean(axis=1) > 0.5)
print(f"Majority vote accuracy: {np.mean(majority == true_states):.3f}")Data simulated: 50 cases, 15 jurors
True average competence: 0.633
Majority vote accuracy: 0.840
Here how our data generating process encodes a baseline degree of juror competence. On any yes/no question we’ve ensured that the average juror has \((p \sim N(.65, \sigma_{disc}))\) chance of getting the right answer. The \(\sigma_{disc}\) determines the range of individual skill in the population. We’ve then asked each juror to cast their vote on 50 decisions.
Sensitivity to Prior Beliefs About Competence
In this model, we treat every employee as an interchangeable unit. We assume there are no factions, no “weird” outliers, and no cases that are harder than others. More important than that is the key is the requirement of conditional independence. No two people are “thinking alike” because of their shared training or culture; they are only “aligned” because they are all seeing the same Truth. This model (Model 1) defines the idealized world of the Condorcet Jury Theorem. Our first sensitivity analysis asks: how much do our conclusions depend on our prior beliefs about juror competence?
# Define prior specifications
prior_specs = {
'weakly_informative': {
'alpha': 3, 'beta': 2,
'desc': 'Weakly informative (centered at 0.6)'
},
'strong_competence': {
'alpha': 10, 'beta': 3,
'desc': 'Strong prior (p ~ 0.76)'
},
'barely_competent': {
'alpha': 6, 'beta': 5,
'desc': 'Skeptical prior (p ~ 0.55)'
},
'incompetent': {
'alpha': 5, 'beta': 10,
'desc': 'Incompetent prior (p ~ 0.33)'
},
}To perform our sensitivity analysis, we test four different “Corporate Climates.” These represent the level of faith an organization has in its own internal competence. When a corporate decision leads to disaster, the post-mortem almost always focuses on competence. The diagnosis is usually that the “wrong people were in the room” or that the team lacked “domain expertise.” This focus on individual skill is a convenient fiction for the investor class; it suggests that the organizational structure is fine, it just needs better “units.”
We’ll see that there are legitimate reasons to worry about individual incompetence for collective decision making. But you need to achieve a critical mass of incompetence for consequential error. In most cases marginal incomptence within an organisation rarely breaks the corporate mechanism. As long as those errors are independent—as long as the “incompetent” people are failing in their own unique, “weird” ways—the majority will still find the truth. The high-performers provide the signal, and the low-performers provide “white noise” that cancels itself out.
Model 1: The Classical Condorcet Jury Model
Let \(T_i \in \{0,1\}\) denote the true state of case \(i = 1,\dots,N\), with \[ T_i \sim \text{Bernoulli}(0.5). \]
Each juror \(j = 1,\dots,J\) casts a binary vote \(V_{ij} \in \{0,1\}\). Conditioned on the truth, all jurors share a common probability of voting correctly: \[ \Pr(V_{ij} = T_i \mid p) = p, \qquad p > \tfrac{1}{2}. \]
Equivalently, the likelihood may be written as: \[ V_{ij} \mid T_i, p \sim \begin{cases} \text{Bernoulli}(p) & \text{if } T_i = 1, \\ \text{Bernoulli}(1-p) & \text{if } T_i = 0. \end{cases} \]
This model imposes three strong assumptions:
- Exchangeability across jurors: all jurors are equally competent.
- Exchangeability across cases: all cases are equally difficult.
- Conditional independence: \[ V_{ij} \perp V_{ik} \mid T_i, p. \]
These assumptions define the idealized world in which the Condorcet Jury Theorem applies.
def fit_base_condorcet_model(votes, prior_spec, n_cases=N_CASES):
"""
Fit basic Condorcet model with specified prior on competence.
This model assumes:
- All jurors have identical competence p
- Votes are conditionally independent given the truth
- Equal prior probability for guilty/not guilty
"""
with pm.Model() as model:
# SENSITIVITY PARAMETER: Prior on competence
p = pm.Beta('p', alpha=prior_spec['alpha'], beta=prior_spec['beta'])
# Latent true state of each case
true_state = pm.Bernoulli('true_state', p=0.5, shape=n_cases)
# Voting probability depends on true state
vote_prob = pm.Deterministic('vote_prob', pm.math.switch(
pm.math.eq(true_state[:, None], 1), p, 1 - p
))
pm.Bernoulli('votes', p=vote_prob, observed=votes)
# Posterior predictive: majority vote accuracy for different jury sizes
for size in [3, 7, 15]:
votes_sim = pm.Bernoulli(f'sim_votes_{size}', p=p, shape=size)
pm.Deterministic(
f'majority_correct_{size}',
pm.math.sum(votes_sim) > size / 2
)
# Sample
idata = pm.sample_prior_predictive()
idata.extend(pm.sample(2000, tune=1000, random_seed=42,
target_accept=0.95, return_inferencedata=True))
idata.extend(pm.sample_posterior_predictive(idata))
return idata, modelThe key property of the model we want to probe is how the accuracy of the majority vote evolves as we incorporate more jurors in the vote. The stressor we apply to test the system is to gauge how this accuracy evolve with degrees of population competence. \[ p \sim \text{Beta}(\alpha, \beta). \]
No assumptions about independence or exchangeability are altered. Instead, this model makes explicit the epistemic commitment that the Condorcet theorem leaves implicit: jurors are often assumed to be better than random a priori. You must hire well so that the average for each employee has a better than coin-flip chance of accuracy. This modest requirement is all we require to ensure collective accuracy, under the conditions of the model.
# Fit under all prior specifications
traces = {}
for prior_name, spec in prior_specs.items():
print(f"\nFitting with {spec['desc']}...")
idata, model = fit_base_condorcet_model(votes, spec)
traces[prior_name] = idata
traces[prior_name + '_model'] = modelLet’s examine how our prior beliefs influence prediction. We have sampled the majority votes under both our prior and the derived posterior distribution. We see how the accuracy of the majority vote grows with more voters when the priors about group competence is high.
def extract_estimates(traces, prior_specs, jury_sizes=[3, 7, 15], stage='prior'):
"""Extract majority accuracy estimates from traces."""
ests = {}
for prior_name in prior_specs.keys():
estimates = []
for size in jury_sizes:
p = traces[prior_name][stage][f'majority_correct_{size}'].mean().item()
estimates.append(p)
ests[prior_name] = estimates
return pd.DataFrame(
ests,
index=[f'Correct % for Majority of {s}' for s in jury_sizes]
)
# Compare prior and posterior estimates
prior_estimates = extract_estimates(traces, prior_specs, stage='prior')
posterior_estimates = extract_estimates(traces, prior_specs, stage='posterior')
print("\n" + "="*70)
print("PRIOR ESTIMATES")
print("="*70)
print(prior_estimates)
print("\n" + "="*70)
print("POSTERIOR ESTIMATES (AFTER SEEING DATA)")
print("="*70)
print(posterior_estimates)
======================================================================
PRIOR ESTIMATES
======================================================================
weakly_informative strong_competence \
Correct % for Majority of 3 0.614 0.848
Correct % for Majority of 7 0.616 0.904
Correct % for Majority of 15 0.652 0.956
barely_competent incompetent
Correct % for Majority of 3 0.534 0.268
Correct % for Majority of 7 0.556 0.204
Correct % for Majority of 15 0.610 0.158
======================================================================
POSTERIOR ESTIMATES (AFTER SEEING DATA)
======================================================================
weakly_informative strong_competence \
Correct % for Majority of 3 0.5380 0.620250
Correct % for Majority of 7 0.5485 0.667125
Correct % for Majority of 15 0.5620 0.727750
barely_competent incompetent
Correct % for Majority of 3 0.475250 0.381875
Correct % for Majority of 7 0.452125 0.330125
Correct % for Majority of 15 0.444750 0.271125
However, we can also see how the aggregation of incompetent opinion tends towards incorrect.
# Visualize the shift from prior to posterior
fig, axs = plt.subplots(1, 2, figsize=(20, 5))
for prior_name in prior_specs.keys():
axs[0].plot(prior_estimates.index, prior_estimates[prior_name],
label=prior_name, marker='o')
axs[1].plot(posterior_estimates.index, posterior_estimates[prior_name],
label=prior_name, marker='o')
axs[0].legend()
axs[1].legend()
axs[0].set_title("Prior Beliefs About Majority Accuracy")
axs[1].set_title("Posterior Beliefs (After Observing Data)")
axs[0].set_ylabel("Probability of Correct Majority Decision")
axs[1].set_ylabel("Probability of Correct Majority Decision")
plt.tight_layout()
plt.show()You cannot avoid the requirement of minimum competence in the jury pool if you hope for collective wisdom. But more strikingly, when we have even modest competence, the aggregation of votes quickly converges towards high accuracy. This is exactly the finding of Condorcet’s Jury Theorem.
Let a group of \(n\) independent individuals (a “jury”) be tasked with choosing between two outcomes, one of which is “correct.” Let \(p\) represent the probability of any single individual making the correct choice. The theorem consists of two parts:
- The Competence Requirement: If \(p > 0.5\), then the probability that a majority of the group makes the correct choice is greater than \(p\).
- The Asymptotic Result: As \(n \to \infty\), the probability that the majority choice is correct approaches \(1\).
Conversely, if \(p < 0.5\), increasing the size of the jury only increases the probability that the group will arrive at the wrong conclusion, with that probability approaching \(1\) as \(n\) grows.
This is a theorem. We’re not showing that it is incorrect, but we are stress-testing it in finite domains to gauge it’s applicability to practical decision making.
This mathematical law is the ‘efficiency’ that the investor class is buying. You might look at this theorem and see a guarantee: as long as we hire ‘competent’ people (\(p > 0.5\)) and align them, we can’t lose. But the theorem has a hidden poison pill. It relies entirely on independent errors, and we know that competence isn’t uniform. This pulls us in two ways: (1) management panic over hiring standards and (2) fostering diversity in the employee base. We’ll look at (1) next.
Individual Differences in Competence
The base Condorcet model assumes all jurors are identically competent. In reality, people vary in expertise, attention, and judgment. Let’s model heterogeneity in juror competence. To gauge these effects we’ll use a hierarchical model where individual competencies are drawn from a population distribution. The key sensitivity parameter is \(\sigma\) (discrimination): how much do jurors differ?
Heterogeneous Juror Competence
We now relax the assumption that all jurors are equally competent. Each juror \(j\) is assigned an individual probability of voting correctly: \[ \text{logit}(p_j) = \mu + \sigma z_j, \qquad z_j \sim \mathcal{N}(0,1). \]
For a fixed juror \(j\), define the number of agreements with the majority: \[ A_j = \sum_{i=1}^N \mathbb{1}\{V_{ij} = \text{majority}_i\}. \]
Under the assumption that cases are exchangeable and votes are conditionally independent given \(p_j\), we obtain the exact likelihood:
\[ A_j \mid p_j \sim \text{Binomial}(N, p_j). \]
This is not an approximation. It is the marginal likelihood obtained by integrating over \[N\] independent Bernoulli trials: \[ \prod_{i=1}^N \text{Bernoulli}(V_{ij} \mid p_j) \;\Longrightarrow\; \text{Binomial}(A_j \mid N, p_j). \]
This observation motivates a re-articuation of the condorcet model. The binomial phrasing is often easier to sample than the Bernoulli likelihood, so we shall switch to that model. Our next model uses a Binomial likelihood and relies on two assumptions. We hold fixed the exchangeability across the cases and we rely on sufficiency of the count statistic. Once jurors are treated as stable measurement instruments, their entire voting history becomes a single aggregated observation.
Model 2: Varying Competence
def fit_hierarchical_model(votes, n_jurors, discrimination_prior):
"""
Fit hierarchical model with individual variation in competence.
Model structure:
- μ: population mean competence (in logit space)
- σ: population standard deviation (SENSITIVITY PARAMETER)
- Each juror has individual competence drawn from N(μ, σ)
We use non-centered parameterization for better sampling.
"""
majority_votes = (votes.mean(axis=1) > 0.5).astype(int)
agreements_per_juror = np.array([
(votes[:, j] == majority_votes).sum() for j in range(n_jurors)
])
with pm.Model() as model:
# Population-level parameters
mu_logit_p = pm.Normal('mu_logit_p', mu=0.6, sigma=0.5)
# KEY SENSITIVITY PARAMETER: individual discrimination
sigma_logit_p = pm.HalfNormal(
'sigma_logit_p',
sigma=discrimination_prior['sigma']
)
# Non-centered parameterization: logit_p = μ + σ * z
z_juror = pm.Normal('z_juror', mu=0, sigma=1, shape=n_jurors)
logit_p_juror = pm.Deterministic(
'logit_p_juror',
mu_logit_p + sigma_logit_p * z_juror
)
p_juror = pm.Deterministic('p_juror', pm.math.invlogit(logit_p_juror))
# Collapsed likelihood: count agreements with majority
pm.Binomial('agreements', n=N_CASES, p=p_juror,
observed=agreements_per_juror)
idata = pm.sample_prior_predictive()
idata.extend(pm.sample(1000, tune=2000, random_seed=42,
target_accept=0.95, return_inferencedata=True,
idata_kwargs={"log_likelihood": True}))
idata.extend(pm.sample_posterior_predictive(idata))
return idata, modelAgain, we’ll test a range of priors. But this time we’ll push on the range of permissable competence in the jury pool.
# Test three levels of discrimination
discrimination_priors = {
'weak_discrimination': {
'sigma': 0.5,
'desc': 'Weak discrimination (σ ~ 0.5)'
},
'moderate_discrimination': {
'sigma': 1.0,
'desc': 'Moderate discrimination (σ ~ 1)'
},
'strong_discrimination': {
'sigma': 2.0,
'desc': 'Strong discrimination (σ ~ 2)'
},
}
traces_discrimination = {}
for prior_name, spec in discrimination_priors.items():
print(f"\nFitting with {spec['desc']}...")
idata, model = fit_hierarchical_model(votes, N_JURORS, spec)
traces_discrimination[prior_name] = idata
traces_discrimination[prior_name + '_model'] = modelWhereas in the previous model we observed each and every vote in the likelihood, here we’re abstracting away from the particular and assuming their realised values are a consequece of latent structure; individual estimable competence.
# Examine one of the fitted models
ax = az.plot_forest(traces_discrimination['strong_discrimination'], var_names=['p_juror'], combined=True)
ax[0].set_title("Diversity of Jury Competence \n Under Strong Discrimination");By collapsing these observations into a Binomial distribution, we focus the model’s attention on the success rate of the collective. Paradoxically, by “throwing away” the individual case observations of the jurors, we force the model to account for the variance in the process itself.
The Generative Process and Implied Votes
It’s for this reason we now rely on Posterior Predictive Sampling (PPC) to reconstruct these votes is that we are no longer just fitting a line to a set of points; we are simulating a generative process. To understand the implications of different discrimination levels, we need to simulate complete jury deliberations. In other words, we need to translate our individual voter’s profile into votes. \[ V_{ij} \mid T_i, p_j \sim \begin{cases} \text{Bernoulli}(p_j) & \text{if } T_i = 1, \\ \text{Bernoulli}(1-p_j) & \text{if } T_i = 0. \end{cases} \]
This will then allow us to test for accuracy of the majority under different ranges of skill. The core insight is that we must forward sample to derive the voting profile of each individual.
def simulate_votes_from_competence(p_juror, n_cases, truth=None):
"""Generate votes given juror competencies and ground truth."""
n_jurors = len(p_juror)
if truth is None:
truth = np.random.binomial(1, 0.5, size=n_cases)
votes = np.zeros((n_cases, n_jurors), dtype=int)
for i in range(n_cases):
for j in range(n_jurors):
prob = p_juror[j] if truth[i] == 1 else 1 - p_juror[j]
votes[i, j] = np.random.binomial(1, prob)
return truth, votesTo do so we’ll define a number of helper functions below.
Code
def compute_diagnostics(votes, truth):
"""Compute suite of diagnostic metrics for jury performance."""
majority = votes.mean(axis=1) > 0.5
diagnostics = {
'majority_accuracy': np.mean(majority == truth),
'unanimity_rate': np.mean(
(votes.sum(axis=1) == 0) | (votes.sum(axis=1) == votes.shape[1])
),
'juror_agreement': np.mean(votes == truth[:, None], axis=0),
}
# Error correlation: do jurors make mistakes together?
errors = votes != truth[:, None]
if errors.var(axis=0).sum() > 0:
diagnostics['error_corr'] = np.corrcoef(errors.T)
else:
diagnostics['error_corr'] = np.zeros((votes.shape[1], votes.shape[1]))
return diagnostics
def majority_accuracy_by_size(votes, truth, jury_size):
"""Calculate accuracy for random sub-juries of given size."""
n_cases, n_jurors = votes.shape
correct = np.zeros(n_cases, dtype=int)
for i in range(n_cases):
jurors = np.random.choice(n_jurors, size=jury_size, replace=False)
majority = votes[i, jurors].mean() > 0.5
correct[i] = (majority == truth[i])
return correct.mean()
def run_ppc_analysis(idata, n_cases, truth, jury_sizes=JURY_SIZES):
"""Run comprehensive posterior predictive checks."""
p_juror_samples = (idata.posterior['p_juror']
.stack(sample=("chain", "draw")).values)
n_jurors, n_samples = p_juror_samples.shape
results = {
'majority_acc': np.zeros(n_samples),
'unanimity': np.zeros(n_samples),
'error_corr': np.zeros((n_samples, n_jurors, n_jurors)),
'accuracy_by_size': {k: np.zeros(n_samples) for k in jury_sizes}
}
for s in range(n_samples):
_, votes = simulate_votes_from_competence(
p_juror_samples[:, s], n_cases, truth
)
diag = compute_diagnostics(votes, truth)
results['majority_acc'][s] = diag['majority_accuracy']
results['unanimity'][s] = diag['unanimity_rate']
results['error_corr'][s] = diag['error_corr']
for k in jury_sizes:
results['accuracy_by_size'][k][s] = (
majority_accuracy_by_size(votes, truth, k)
)
return results
def summarize_ppc(ppc_results, jury_sizes=JURY_SIZES):
"""Create summary DataFrame from PPC results."""
percentiles = [5, 50, 95]
summaries = []
for k in jury_sizes:
summaries.append(np.percentile(
ppc_results['accuracy_by_size'][k], percentiles
))
df = pd.DataFrame(summaries).T
df.columns = [f'majority_accuracy_{k}' for k in jury_sizes]
df.index = [f'percentile_{p}' for p in percentiles]
return df
def compare_prior_posterior(idata, n_cases, truth, jury_sizes=JURY_SIZES):
"""Compare prior and posterior predictive distributions."""
results = {}
for stage in ['prior', 'posterior']:
p_samples = (getattr(idata, stage)['p_juror']
.stack(sample=("chain", "draw")).values)
n_jurors, n_samples = p_samples.shape
# Simplified PPC for comparison
stage_results = {k: np.zeros(n_samples) for k in jury_sizes}
for s in range(n_samples):
_, votes = simulate_votes_from_competence(
p_samples[:, s], n_cases, truth
)
for k in jury_sizes:
stage_results[k][s] = majority_accuracy_by_size(votes, truth, k)
results[stage] = summarize_ppc({'accuracy_by_size': stage_results},
jury_sizes)
return pd.concat(results, names=['stage', 'percentile'])
def plot_prior_posterior_comparison(df, title="Majority Accuracy"):
"""Plot prior vs posterior distributions."""
x_values = JURY_SIZES
fig, ax = plt.subplots(figsize=(10, 6))
for stage, color in [('prior', 'blue'), ('posterior', 'red')]:
median = df.loc[(stage, 'percentile_50')]
low = df.loc[(stage, 'percentile_5')]
high = df.loc[(stage, 'percentile_95')]
ax.plot(x_values, median, label=f'{stage.title()} Median',
color=color, marker='o')
ax.fill_between(x_values, low, high, color=color, alpha=0.2,
label=f'{stage.title()} (5th-95th)')
ax.set_title(title)
ax.set_xlabel('Number of Jurors')
ax.set_ylabel('Majority Accuracy')
ax.set_xticks(x_values)
ax.legend()
ax.grid(True, linestyle='--', alpha=0.6)
plt.tight_layout()
return figNow let’s apply this framework. We’ll focus on the moderate discrimination case as it’s most realistic:
# Analyze the moderate discrimination case
print(f"\n{'='*70}")
print(f"Analysis: {discrimination_priors['moderate_discrimination']['desc']}")
print('='*70)
comparison = compare_prior_posterior(
traces_discrimination['moderate_discrimination'],
N_CASES,
true_states
)
comparison
======================================================================
Analysis: Moderate discrimination (σ ~ 1)
======================================================================
| majority_accuracy_3 | majority_accuracy_5 | majority_accuracy_7 | majority_accuracy_10 | majority_accuracy_15 | ||
|---|---|---|---|---|---|---|
| stage | percentile | |||||
| prior | percentile_5 | 0.419 | 0.40 | 0.34 | 0.38 | 0.34 |
| percentile_50 | 0.700 | 0.74 | 0.78 | 0.82 | 0.88 | |
| percentile_95 | 0.920 | 0.96 | 0.98 | 1.00 | 1.00 | |
| posterior | percentile_5 | 0.580 | 0.62 | 0.66 | 0.70 | 0.74 |
| percentile_50 | 0.700 | 0.74 | 0.78 | 0.82 | 0.86 | |
| percentile_95 | 0.800 | 0.84 | 0.88 | 0.90 | 0.94 |
We still see convergen to the truth as we scale up the size of our majority even when we allow heterogeneous levels of skill in the voting population. The median prior and posterior estimates of accuracy are quite close, but we’ve significantly shrunk the uncertainty in our estimate.
plot_prior_posterior_comparison(
comparison,
title="Prior vs Posterior: Moderate Individual Variation"
);The pattern is consistent across all discrimination levels: the data updates our beliefs, and larger juries show higher accuracy. The key question is whether errors remain independent.
Error Correlation Analysis
A critical assumption of Condorcet is independence: jurors make errors independently. Let’s check this for our moderate discrimination model:
def plot_error_correlation_heatmap(ppc_results, title="Error Correlation"):
"""Plot mean error correlation matrix, handling NaN values properly."""
all_corrs = ppc_results['error_corr'] # (n_samples, n_jurors, n_jurors)
# Use nanmean to properly average across samples, ignoring NaNs
mean_corr = np.nanmean(all_corrs, axis=0)
# For cells that are still NaN (all samples were NaN), replace with 0
mean_corr = np.nan_to_num(mean_corr, nan=0.0)
mean_corr = np.round(mean_corr, 2)
fig, ax = plt.subplots(figsize=(15, 6))
sns.heatmap(mean_corr, vmin=-.3, vmax=.3, cmap="coolwarm",
square=False, cbar_kws={"label": "Error correlation"}, ax=ax,
annot=True, fmt=".2f")
ax.set_title(title, fontsize=20)
ax.set_xlabel("Voter Index")
ax.set_ylabel("Voter Index")
plt.tight_layout()
return fig
def summarize_error_correlation(ppc_results):
"""Extract summary statistics from error correlation matrices."""
corr = ppc_results['error_corr']
n = corr.shape[1]
off_diag = []
for s in range(corr.shape[0]):
mat = corr[s]
# Extract upper triangle (excluding diagonal)
upper_tri = mat[np.triu_indices(n, k=1)]
# Only include non-NaN values
valid_values = upper_tri[~np.isnan(upper_tri)]
if len(valid_values) > 0:
off_diag.extend(valid_values)
off_diag = np.array(off_diag)
if len(off_diag) == 0:
return {
'mean_off_diag': np.nan,
'sd_off_diag': np.nan,
'p95_abs_corr': np.nan,
}
return {
'mean_off_diag': off_diag.mean(),
'sd_off_diag': off_diag.std(),
'p95_abs_corr': np.percentile(np.abs(off_diag), 95),
}
ppc_moderate = run_ppc_analysis(
traces_discrimination['moderate_discrimination'],
N_CASES,
true_states
)
fig = plot_error_correlation_heatmap(
ppc_moderate,
title="Error Correlation Among Voters: \n Moderate Individual Skill Difference"
);
fig.savefig("Independent_voter_errors_correlation.png")print("\nError Correlation Summary:")
summarize_error_correlation(ppc_moderate)
Error Correlation Summary:
{'mean_off_diag': np.float64(4.1269912183272494e-05),
'sd_off_diag': np.float64(0.14293965923889015),
'p95_abs_corr': np.float64(0.2792611296287719)}
Key insight: With heterogeneous competence alone, errors remain largely uncorrelated. The Condorcet theorem’s independence assumption holds—so far. Which is to say that diverse workforces and varying degrees of competence do not inherently short-circuit the firm’s ability to learn.
From Aggregated to Generative: Why We Need Item Response Theory
The binomial model served us well for exploring individual differences because we only cared whether jurors on average agreed with majorities. But to explore shared case difficulty—where hard cases cause everyone to struggle simultaneously—we need to model each vote explicitly.
Enter Item Response Theory (IRT). Instead of collapsing votes into agreement counts, we now model: \[V_{ij} \mid T_i, p_j, \delta_i\]
This lets us ask: when case i is hard (large |δ_i|), do errors cluster across jurors? The binomial model can’t answer this because it averages over cases. The IRT model can, and the answer will show us where the Condorcet theorem truly breaks down.
Voting Blocks and Improvement Programmes
Now we introduce the forces that will determine whether an organization can sustain collective wisdom or succumbs to coordinated failure: block effects (\(\beta\)) and treatment programmes (\(\tau\)). These structural additions represent opposing gravitational forces in a delicate equilibrium. Pull too hard in one direction and you get chaos; pull too hard in the other and you get lockstep failure.
The Mathematical Structure: Competing Forces
We model each juror’s vote as a negotiation between three forces:
\[ \text{logit}\,\Pr(V_{ij} = T_i) = \underbrace{\alpha_j}_{\text{individual skill}} + \underbrace{\beta_{b(j)}}_{\text{block gravity}} + \underbrace{\delta_i}_{\text{case difficulty}} - \underbrace{\tau_j \cdot Z_j}_{\text{treatment push}} \]
Where:
\(\alpha_j\): Individual competence (heterogeneous across jurors)
\(\beta_b(j)\): Block effect for juror j’s group (shared within blocks, creating correlation)
\(\delta_i\): Case difficulty (shared across all jurors for case i)
\(\tau_j \cdot Z_j\): Treatment effect (Z_j = 1 if juror j receives intervention, 0 otherwise)
The sign on the treatment term is critical: it subtracts from the combined logit. This means treatment is pushing against the combined force of individual tendency + block conformity + case difficulty.
Think of it this way: Without treatment, a juror’s probability of voting correctly is determined by their skill, modified by their block’s shared bias, and further modified by how hard the case is. Treatment attempts to “decouple” them from their block’s default position—to make them think independently even when their block would pull them in a particular direction.
If treatment works, \(\tau\) should be positive: it reduces the influence of the \((\alpha + \beta + \delta)\) package and forces the juror to reconsider from first principles.
Block effects can arise from legitimate sources: engineers possess genuine domain expertise that others lack, sales teams have unique customer insights, executives have access to strategic information. These blocks may differ in average competence, and those differences can be entirely justified.
But for the Condorcet Jury Theorem, what matters is not whether one group is more competent than another—it’s whether their errors are independent. A block of highly skilled specialists who share the same blind spots creates correlated errors that violate the theorem’s assumptions. Even if Block A is objectively more accurate than Block B on average, if all members of Block A fail together on hard cases, the wisdom of crowds collapses.
This analysis explores these failure conditions: the point at which shared frameworks, training, or information access cause errors to cluster rather than cancel. The statistical machinery cannot distinguish “good blocks” from “bad blocks”—it can only detect correlation. And correlation, regardless of its source, is what breaks collective intelligence.
The Anatomy of Blocks: Epistemic Consolidation and Value Capture
In real organizations, people don’t arrive at decisions in isolation. They cluster into groups that share:
- Functional training: All the engineers learned the same architecture patterns; all the MBAs learned the same strategy frameworks
- Information access: The sales team sees customer feedback the finance team never sees; executives receive filtered summaries while front-line workers see raw reality
- Incentive structures: Different departments optimize for different metrics, creating systematically different biases
- Social networks: People who talk to each other regularly start thinking alike, even without realizing it
These patterns promote an consolidation of epistemological standards. Members of a block don’t just share surface characteristics; they share ways of knowing and hence ways of being wrong. We saw that case difficulty created correlation across the entire jury for specific cases. But block effects create persistent correlation within subgroups across all cases. This is structurally worse. And then when evidence is ambiguous and cases difficult, people fall back on their frameworks, and if everyone shares frameworks, everyone fails together. In this way case-difficulty and group-think interact to drive an organisation to overly simplified, but easily defensible heuristics.
“Value capture happens when a person or group adopts an externally-sourced value as their own, without adapting it to their particular context… In value capture, we outsource the process of value deliberation. And, as with other forms of outsourcing, there is a trade-off…When we adopt those values, we gain access to readymade methods for justification” - Nguygen
The adoption of an external standard or metric can confer legitimacy, but also tends to favour an abstraction that ignore relevant detail. We lose focus on local context we make our goals legible beyond it. “We did it because everyone else was doing it” works until it doesn’t.
The Legibility Trap: Redux
Recall that investors demand “alignment” and “standardization”. They want unified KPI systems, “Culture fit”-hiring and best-practices adoption, Common Tooling. These forces push us towards organisations towards consolidation of shared incentives, methods and cognitive practices. Not only can corporate values supplant your individual perspective, they do so at scale. Every legibility-seeking practice is a block-creating mechanism. By making the organization more “readable” to outsiders, you create the very correlations that break collective intelligence.
The Treatment Paradox
Pulling in the opposite direction. Organisations aren’t static and leaders can intervene. They can run training programs, hire consultants, promote new processes and suggest or mandate resolution frameworks. The question is whether these interventions break group-think or reinforce it? One useful example is the pattern of “red-teaming” in software development. Before a major decision, you randomly assign people from each block to argue against their block’s default position. This intervention creates divergent pressure. It doesn’t eliminate block membership. Engineers remain engineers, but it forces block members to engage with perspectives that directly challenge their shared frameworks. The treatment pulls against block effects. Or so goes the theory.
The Complete Model: Modeling the Push and Pull
We’ve now seen two opposing forces at work in organizational decision-making:
The Pull Toward Legibility (Block Effects): Shared training, common frameworks, aligned incentives. These create β-blocks—groups who think alike not because they’re incompetent, but because they’ve been optimized for coherence. This is the gravitational force that makes organizations manageable but epistemically fragile.
The Push Toward Independence (Treatment Effects): Interventions designed to break people out of their default frameworks. Red-teaming exercises, devil’s advocate assignments, cross-functional rotation, deliberately soliciting dissent. These are organizational attempts to restore the independence that the Condorcet theorem requires.
Our final model formalizes this tension. Block effects (β) pull jurors toward correlated errors. Treatment effects (τ) push them back toward independence. The question is whether deliberate intervention can overcome structural gravity—or whether the legibility trap is too strong to escape.
treatment = np.ones(N_JURORS)
with pm.Model() as irt_model:
# -----------------------------
# Hyperpriors
# -----------------------------
sigma_delta = pm.HalfNormal("sigma_delta", 1.0)
sigma_block = pm.HalfNormal("sigma_block", 1.0)
sigma_juror = pm.HalfNormal("sigma_juror", 1.0)
# Treatment effect (mirrors your tau)
tau = pm.Normal("tau", 0.0, 1.0)
# -----------------------------
# Case-level difficulty
# -----------------------------
delta = pm.Normal(
"delta",
mu=0.0,
sigma=sigma_delta,
shape=N_CASES
)
# -----------------------------
# Latent ground truth
# -----------------------------
z = pm.Bernoulli(
"z",
p=0.5,
shape=N_CASES
)
# -----------------------------
# Block-level effects
# -----------------------------
alpha_block = pm.Normal(
"alpha_block",
mu=0.0,
sigma=sigma_block,
shape=N_BLOCKS
)
# -----------------------------
# Juror-level ability
# -----------------------------
eps = pm.Normal(
"eps",
mu=0.0,
sigma=sigma_juror,
shape=N_JURORS
)
theta = pm.Deterministic(
"theta",
alpha_block[BLOCK_ID] + tau * treatment + eps
)
# -----------------------------
# IRT likelihood
# -----------------------------
# p_ij = logistic(theta_j - delta_i)
logit_p = theta[None, :] - delta[:, None]
p = pm.math.sigmoid(logit_p)
# Probability of observed vote
# If z_i = 1: y_ij ~ Bernoulli(p_ij)
# If z_i = 0: y_ij ~ Bernoulli(1 - p_ij)
vote_prob = z[:, None] * p + (1 - z[:, None]) * (1 - p)
y_obs = pm.Bernoulli(
"y_obs",
p=vote_prob,
observed=votes_blocked
)
with irt_model:
idata_irt = pm.sample(
draws=2000,
tune=2000,
chains=4,
target_accept=0.9
)
ppc = pm.sample_posterior_predictive(
idata_irt,
var_names=["y_obs"]
)Multiprocess sampling (4 chains in 4 jobs)
CompoundStep
>NUTS: [sigma_delta, sigma_block, sigma_juror, tau, delta, alpha_block, eps]
>BinaryGibbsMetropolis: [z]
Sampling 4 chains for 2_000 tune and 2_000 draw iterations (8_000 + 8_000 draws total) took 18 seconds.
/Users/nathanielforde/mambaforge/envs/applied-bayesian-regression-modeling-env/lib/python3.13/site-packages/arviz/stats/diagnostics.py:596: RuntimeWarning: invalid value encountered in scalar divide
(between_chain_variance / within_chain_variance + num_samples - 1) / (num_samples)
There were 6 divergences after tuning. Increase `target_accept` or reparameterize.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters. A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details
Sampling: [y_obs]
import numpy as np
# ---------------------------
# 1. Stack posterior predictive & latent truth
# ---------------------------
y_rep = ppc.posterior_predictive["y_obs"].stack(sample=("chain", "draw")) # (cases, jurors, total_samples)
z_post = idata_irt.posterior["z"].stack(sample=("chain", "draw")) # (cases, total_samples)
# ---------------------------
# 2. Parameters
# ---------------------------
total_jurors = y_rep.shape[1] # 15
cases = y_rep.shape[0] # 50
total_samples = y_rep.shape[2] # 8000
n_samples = 1000 # subsample for speed
# Subsample posterior draws
sample_idx = np.random.choice(total_samples, size=n_samples, replace=False)
y_rep_sub = y_rep[:, :, sample_idx] # (cases, jurors, n_samples)
z_post_sub = z_post[:, sample_idx] # (cases, n_samples)
# ---------------------------
# 3. Initialize results
# ---------------------------
jury_sizes = np.arange(1, total_jurors + 1)
accuracy_mean = []
accuracy_hdi_lower = []
accuracy_hdi_upper = []
# ---------------------------
# 4. Loop over jury sizes
# ---------------------------
for k in jury_sizes:
# Pick k random jurors for each k
jurors = np.random.choice(total_jurors, size=k, replace=False)
# Extract votes for selected jurors: (cases, k, n_samples)
votes = y_rep_sub[:, jurors, :]
# Majority vote per case per sample
majority_vote = (votes.sum(axis=1) > k / 2).astype(int) # shape: (cases, n_samples)
# Compare to latent truth
correct = (majority_vote == z_post_sub) # (cases, n_samples)
# Posterior accuracy per sample
posterior_accuracy = correct.mean(axis=0) # (n_samples,)
# Summarize posterior
accuracy_mean.append(posterior_accuracy.mean())
accuracy_hdi_lower.append(np.percentile(posterior_accuracy, 2.5))
accuracy_hdi_upper.append(np.percentile(posterior_accuracy, 97.5))
# ---------------------------
# 5. Convert to numpy arrays
# ---------------------------
accuracy_mean = np.array(accuracy_mean)
accuracy_hdi_lower = np.array(accuracy_hdi_lower)
accuracy_hdi_upper = np.array(accuracy_hdi_upper)
# ---------------------------
# 6. Plot posterior majority accuracy vs jury size
# ---------------------------
import matplotlib.pyplot as plt
plt.fill_between(jury_sizes, accuracy_hdi_lower, accuracy_hdi_upper, alpha=0.3, label="95% CI")
plt.plot(jury_sizes, accuracy_mean, marker='o', label="Posterior mean")
plt.xlabel("Number of jurors")
plt.ylabel("Posterior majority accuracy")
plt.title("Accuracy vs Jury Size")
plt.legend()
plt.show()# Generate new data with block structure
votes_blocked, p_jurors_blocked, true_states_blocked = simulate_jury_data(
N_CASES, N_JURORS, block_id=BLOCK_ID
)
def fit_full_model(votes, n_jurors, block_id, treatment_indicator,
use_treatment=0):
"""
Complete model with four sources of variation:
1. Individual skill (α_j)
2. Block/faction effects (β_block)
3. Case difficulty (δ_case)
4. Treatment Programme (tau)
"""
majority_votes = (votes.mean(axis=1) > 0.5).astype(int)
agreements_per_juror = np.array([
(votes[:, j] == majority_votes).sum() for j in range(n_jurors)
])
with pm.Model() as model:
# Individual skill
mu_alpha = 0
sigma_alpha = pm.Exponential("sigma_alpha", lam=3.0)
alpha_raw = pm.Normal("alpha_raw", 0.0, 1.0, shape=n_jurors)
alpha_j = pm.Deterministic("alpha_j", mu_alpha + sigma_alpha * alpha_raw)
# Block effects (ideological factions, info silos)
n_blocks = len(np.unique(block_id))
sigma_block = pm.HalfNormal("sigma_block", sigma=1.0)
block_effect_raw = pm.Normal("block_effect_raw", mu=0.0, sigma=1.0, shape=n_blocks)
block_effect = pm.Deterministic("block_effect", sigma_block * block_effect_raw)
beta_block_j = block_effect[block_id]
# Case difficulty (collapsed over cases)
# Simplify:
mu_case = pm.Normal("mu_case", mu=0.0, sigma=0.5)
sigma_case = pm.HalfNormal("sigma_case", sigma=1.0)
delta_bar = pm.Normal("delta_bar", mu=mu_case, sigma=sigma_case)
# -----------------------------
# Treatment effect (switchable)
# -----------------------------
tau = pm.Exponential("tau", 3.0, shape=N_JURORS)
# convert to tensors to avoid shape surprises
Z = pm.math.constant(treatment_indicator)
s = pm.math.constant(use_treatment)
treatment_term = pm.Deterministic('trt', s * Z * tau)
# Combined model
logit_p_correct = ((alpha_j + beta_block_j + delta_bar) - treatment_term)
p_correct = pm.Deterministic("p_correct",
pm.math.sigmoid(logit_p_correct))
# Collapsed likelihood
pm.Binomial("agreements", n=N_CASES, p=p_correct,
observed=agreements_per_juror)
idata = pm.sample(2000, tune=2000, target_accept=0.99,
return_inferencedata=True)
return idata, model
print("\nFitting complete hierarchical model...")
idata_full, model_full = fit_full_model(votes_blocked, N_JURORS, BLOCK_ID, treatment_indicator=np.zeros(N_JURORS), use_treatment=0)
treatment_assignment = np.ones(N_JURORS)
idata_full_trt, model_full_trt = fit_full_model(votes_blocked, N_JURORS, BLOCK_ID, treatment_indicator=treatment_assignment, use_treatment=1)We fit two versions of the model against the same blocked data:
Model 1: Natural State (Z = 0 for all jurors) This is the organization as it exists: blocks have formed through natural processes (hiring, training, cultural evolution), and no deliberate effort is made to break people out of their silos. All variance in voting patterns must be explained by individual skill, block membership, and case difficulty.
Model 2: Intervention State (Z = 1 for all jurors) This represents an organization that has implemented a comprehensive independence-restoration program. Examples might include:
Red-teaming: Before voting, randomly assign people to argue against their block’s default position
Rotation: Temporarily embed engineers in sales, finance in product
Structured dissent: Require each block to produce an internal critic before voting
Anonymous voting: Remove social pressure to conform to block consensus
The treatment indicator \(Z_j = 1\) tells the model: “This juror has been subjected to an intervention designed to reduce their dependence on block-default thinking.”
The question: Can such interventions meaningfully push against the gravity of legibility? Or does the correlation persist despite our best efforts?
ax = az.plot_forest([idata_full, idata_full_trt], var_names=["sigma_alpha", "sigma_block", "mu_case", "sigma_case", "trt"], combined=True, model_names=['Block + Without Treatment', 'Block + With Treatment'], figsize=(10, 7))
ax[0].set_title("Comparing Parameter Effects \n With and Without Improvement Plan");What the Model Tells Us About Organizational Intervention
The forest plot reveals something uncomfortable: when we “turn on” treatment (Z = 1), the model doesn’t conclude that block effects have weakened. Instead, it concludes that block effects must be even stronger than we thought.
Why? Because the voting patterns remain stubbornly correlated despite the presumed intervention. The Bayesian engine reasons backwards: “If treatment is supposed to push people toward independence, but I still see strong correlation in the data, then either (a) treatment doesn’t work, or (b) the underlying block effects are so powerful that even treated jurors remain influenced.”
The model chooses interpretation (b), allocating more variance to σ_block in the treatment model. This is the statistical signature of organizational inertia: interventions that look good on paper (red-teaming, rotation programs, diversity training) fail to break the fundamental correlation structure.
The treatment effect τ remains small or uncertain because the model cannot distinguish “treatment is working a little” from “blocks are just that strong.” This is not a flaw in the model—it’s an honest reflection of the data-generating process. If block effects are structural (shared information, shared incentives, shared training), surface-level interventions may not be enough.
Posterior Predictive Checks for Complete Model
Fitting the model is only half the battle; the PPC allows us to simulate “new” decisions from the inferred organizational structure to see if the “Condorcet Miracle” survives the reality of corporate blocks. In our earlier model we saw majority accuracy march steadily toward \(100\%\) as the jury size grew. Here, that progress hits a hard ceiling.
def run_ppc_full_model(idata, n_cases, true_states, block_id,
jury_sizes=JURY_SIZES, n_draws=500):
"""
PPC for complete model including block and case effects.
Properly samples from posterior distribution.
"""
# Extract posterior samples
alpha_j = idata.posterior['alpha_j'].stack(sample=("chain", "draw")).values
block_effect = idata.posterior['block_effect'].stack(sample=("chain", "draw")).values
sigma_case = idata.posterior['sigma_case'].stack(sample=("chain", "draw")).values
mu_case = idata.posterior['mu_case'].stack(sample=("chain", "draw")).values
trt = idata.posterior['trt'].stack(sample=("chain", "draw")).values
n_jurors = alpha_j.shape[0]
total_samples = alpha_j.shape[1]
# Randomly select n_draws samples
sample_idx = np.random.choice(total_samples, size=min(n_draws, total_samples),
replace=False)
n_draws = len(sample_idx)
results = {
'majority_acc': np.zeros(n_draws),
'error_corr': np.zeros((n_draws, n_jurors, n_jurors)),
'accuracy_by_size': {k: np.zeros(n_draws) for k in jury_sizes}
}
rng = np.random.default_rng(42)
for idx, s in enumerate(sample_idx):
# Sample case difficulty effects for this posterior draw
delta_case = rng.normal(mu_case[s], sigma_case[s], size=n_cases)
# Generate votes for each case
votes = np.zeros((n_cases, n_jurors), dtype=int)
for i in range(n_cases):
truth_i = true_states[i]
sign = 1 if truth_i == 1 else -1
for j in range(n_jurors):
# Combine individual skill + block effect + case difficulty
logit_p = sign * (alpha_j[j, s] + block_effect[block_id[j], s]) + delta_case[i] + trt[j, s]
p = 1 / (1 + np.exp(-logit_p))
votes[i, j] = rng.binomial(1, p)
# Compute diagnostics
diag = compute_diagnostics(votes, true_states)
results['majority_acc'][idx] = diag['majority_accuracy']
results['error_corr'][idx] = diag['error_corr']
for k in jury_sizes:
results['accuracy_by_size'][k][idx] = (
majority_accuracy_by_size(votes, true_states, k)
)
return resultsThe power of the hierarchical model lies in its ability to “partial out” exactly why a juror was wrong. In the vanilla model, an incorrect vote was simply “incompetence.” Here, it is a structural outcome. To derive the vote profile of each individual, we have re-assembled the components we just decomposed. Forward sampling in this way allows us to see how the probability of correctness for any single voter is a negotiation between their own talent and the gravitational forces of the system they inhabit.
print("\nRunning posterior predictive checks for complete model...")
ppc_full = run_ppc_full_model(idata_full, N_CASES, true_states_blocked,
BLOCK_ID, n_draws=500)
ppc_full_trt = run_ppc_full_model(idata_full_trt, N_CASES, true_states_blocked,
BLOCK_ID, n_draws=500)
# Summarize accuracy by jury size
summary_full = summarize_ppc(ppc_full)
print("\n" + "="*70)
print("MAJORITY ACCURACY BY JURY SIZE (Complete Model)")
print("="*70)
summary_full
Running posterior predictive checks for complete model...
/Users/nathanielforde/mambaforge/envs/applied-bayesian-regression-modeling-env/lib/python3.13/site-packages/numpy/lib/_function_base_impl.py:3065: RuntimeWarning: invalid value encountered in divide
c /= stddev[:, None]
/Users/nathanielforde/mambaforge/envs/applied-bayesian-regression-modeling-env/lib/python3.13/site-packages/numpy/lib/_function_base_impl.py:3066: RuntimeWarning: invalid value encountered in divide
c /= stddev[None, :]
/Users/nathanielforde/mambaforge/envs/applied-bayesian-regression-modeling-env/lib/python3.13/site-packages/numpy/lib/_function_base_impl.py:3065: RuntimeWarning: invalid value encountered in divide
c /= stddev[:, None]
/Users/nathanielforde/mambaforge/envs/applied-bayesian-regression-modeling-env/lib/python3.13/site-packages/numpy/lib/_function_base_impl.py:3066: RuntimeWarning: invalid value encountered in divide
c /= stddev[None, :]
======================================================================
MAJORITY ACCURACY BY JURY SIZE (Complete Model)
======================================================================
| majority_accuracy_3 | majority_accuracy_5 | majority_accuracy_7 | majority_accuracy_10 | majority_accuracy_15 | |
|---|---|---|---|---|---|
| percentile_5 | 0.36 | 0.359 | 0.339 | 0.339 | 0.34 |
| percentile_50 | 0.62 | 0.640 | 0.640 | 0.660 | 0.66 |
| percentile_95 | 0.84 | 0.900 | 0.920 | 0.940 | 0.98 |
Notice how the increasing size of the voting block does not substantially improve on the accuracy. There is strong evidence of a plateau effect in the majority accuracy. The combination of new voters is not providing new information. This pattern is also unfortunately visible in our treatment model.
summary_full_trt = summarize_ppc(ppc_full_trt)
print("\n" + "="*70)
print("MAJORITY ACCURACY BY JURY SIZE (Complete Model + Treatment)")
print("="*70)
summary_full_trt
======================================================================
MAJORITY ACCURACY BY JURY SIZE (Complete Model + Treatment)
======================================================================
| majority_accuracy_3 | majority_accuracy_5 | majority_accuracy_7 | majority_accuracy_10 | majority_accuracy_15 | |
|---|---|---|---|---|---|
| percentile_5 | 0.38 | 0.38 | 0.40 | 0.399 | 0.38 |
| percentile_50 | 0.64 | 0.64 | 0.66 | 0.680 | 0.66 |
| percentile_95 | 0.88 | 0.92 | 0.96 | 0.980 | 1.00 |
The allocation of a heterogenous treatment progromme was also not sufficient to break the block voting effects.
# Plot error correlations
plot_error_correlation_heatmap(
ppc_full,
title="Error Correlation: Complete Model with Block & Case Effects"
)
print("\n" + "="*70)
print("ERROR CORRELATION SUMMARY (Complete Model)")
print("="*70)
error_summary = summarize_error_correlation(ppc_full)
for key, value in error_summary.items():
print(f"{key}: {value:.3f}")
======================================================================
ERROR CORRELATION SUMMARY (Complete Model)
======================================================================
mean_off_diag: 0.140
sd_off_diag: 0.173
p95_abs_corr: 0.429
Notice the structured patterns in the error correlation heatmap—jurors within the same block show correlated errors. This is the smoking gun: block effects create dependencies that violate the independence assumption.
# Plot error correlations
plot_error_correlation_heatmap(
ppc_full_trt,
title="Error Correlation: Complete Model with Block & Case Effects"
)
print("\n" + "="*70)
print("ERROR CORRELATION SUMMARY (Complete Model)")
print("="*70)
error_summary = summarize_error_correlation(ppc_full_trt)
for key, value in error_summary.items():
print(f"{key}: {value:.3f}")
======================================================================
ERROR CORRELATION SUMMARY (Complete Model)
======================================================================
mean_off_diag: 0.190
sd_off_diag: 0.183
p95_abs_corr: 0.492
While the correlation patterns change slightly, they remain high and reduce the effectiveness of the majority.
The treatment paradox captures the organizational dilemma in statistical form. Leaders genuinely try to foster independence: they hire for “cognitive diversity,” they implement devil’s advocate protocols, they rotate people across functions. These are the \(\tau\) terms in our model—deliberate pushes against the pull of legibility.
But our posterior estimates suggest that these interventions often fail to overcome structural forces. The \(\beta\) terms (block effects driven by shared training, information, and incentives) dominate the \(\tau\) terms (deliberate interventions). This isn’t because leaders are incompetent or uncommitted—it’s because block formation is structural while treatment is episodic. You can send an engineer to shadow a sales call (\(\tau\)), but they return to an engineering team with engineering metrics, engineering promotion criteria, and engineering social networks (\(\beta\)). The gravitational field reasserts itself.
The model cannot tell us whether stronger interventions would work—whether, for instance, permanently breaking up blocks through forced cross-functional teams would restore independence. What it does tell us is that incremental, well-intentioned programs often aren’t enough to overcome the correlation that legibility creates.
Conclusions
A failure to properly instrumentalize values through appropriate proxy metrics hinders communication, cohesion and cooperation within and between organisations. And yet too crude an instrumentalization of value turns work from a compelling collective action project into a metric movement charade. Worse, it drives us to error and expected failure.
The statistical breakdown of the Condorcet Jury Theorem is not an accident of poor corporate planning; it is the inevitable result of a fundamental tension in cooperative work. We are witnessing a persistent tug-of-war between two opposing gravitational forces:
The Pull for Legibility (Scott): The necessary drive to make the organization understandable to the investor, the analyst, and the leader. This force creates the “blocks,” the standardized workflows, and the shared cultures that allow thousands of people to move in the same direction.
The Push for Agency (Nguyen): The individual’s resistance to Value Capture. This is the drive to maintain a unique, high-fidelity perspective on reality that hasn’t been flattened by a KPI or a “Core Value.”
When we let the pull for Legibility dominate entirely, we risk the “Contemptible Familiarity” of the over-optimized corporation. We get a system that is easy to manage on a spreadsheet or JIRA-board but potentially epistemically compromised. In the worst case: a group of 1,000 people with the collective wisdom of one, because they have all been trained to see the world through the same lens—and thus to share the same blind spots. This is the cardinal sin of “Value Capture”. We have adopted the organization’s simplified map of the world as our own, and in doing so, we have destroyed our ability to help the organization see.
Citation
@online{forde,
author = {Forde, Nathaniel},
title = {The {Condorcet} {Jury} {Theorem} and {Democratic}
{Rationality}},
langid = {en}
}